Simple supervised document geolocation with geodesic grids
نویسندگان
چکیده
We investigate automatic geolocation (i.e. identification of the location, expressed as latitude/longitude coordinates) of documents. Geolocation can be an effective means of summarizing large document collections and it is an important component of geographic information retrieval. We describe several simple supervised methods for document geolocation using only the document’s raw text as evidence. All of our methods predict locations in the context of geodesic grids of varying degrees of resolution. We evaluate the methods on geotagged Wikipedia articles and Twitter feeds. For Wikipedia, our best method obtains a median prediction error of just 11.8 kilometers. Twitter geolocation is more challenging: we obtain a median error of 479 km, an improvement on previous results for the dataset.
منابع مشابه
Kernel Density Estimation for Text-Based Geolocation
Text-based geolocation classifiers often operate with a grid-based view of the world. Predicting document location of origin based on text content on a geodesic grid is computationally attractive since many standard methods for supervised document classification carry over unchanged to geolocation in the form of predicting a most probable grid cell for a document. However, the grid-based approa...
متن کاملHierarchical Discriminative Classification for Text-Based Geolocation
Text-based document geolocation is commonly rooted in language-based information retrieval techniques over geodesic grids. These methods ignore the natural hierarchy of cells in such grids and fall afoul of independence assumptions. We demonstrate the effectiveness of using logistic regression models on a hierarchy of nodes in the grid, which improves upon the state of the art accuracy by sever...
متن کاملSupervised Text-based Geolocation Using Language Models on an Adaptive Grid
The geographical properties of words have recently begun to be exploited for geolocating documents based solely on their text, often in the context of social media and online content. One common approach for geolocating texts is rooted in information retrieval. Given training documents labeled with latitude/longitude coordinates, a grid is overlaid on the Earth and pseudo-documents constructed ...
متن کاملText Classification Based On Manifold Semi- Supervised Support Vector Machine
This article presents a solution along with experimental results for an application of semi-supervised machine learning techniques and improvement on the SVM (Support Vector Machine) based on geodesic model to build text classification applications for Vietnamese language. The objective here is to improve the semi-supervised machine learning by replacing the kernel function of SVM using geodesi...
متن کاملSparse Geodesic Paths
In this paper we propose a new distance metric for signals that admit a sparse representation in a known basis or dictionary. The metric is derived as the length of the sparse geodesic path between two points, by which we mean the shortest path between the points that is itself sparse. We show that the distance can be computed via a simple formula and that the entire geodesic path can be easily...
متن کامل